Skip to content

Conversation

mludvig
Copy link

@mludvig mludvig commented Jun 10, 2025

I was trying to generate a testset from KnowledgeGraph(nodes: 1861, relationships: 912103) and it would go on for two days in a row without any progress. So I did some profiling to see what's going on on a smaller KG and found that most of the time was being spent in find_indirect_clusters().

Here is an optimised implementation that seem to produce similar results but way faster. On my reduced size KG I'm down from 5 min to a fraction of a second and on the full KG I'm down from never finished to a few seconds.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 10, 2025
@Lukevanl
Copy link

Ran into the exact same issue and indeed made a similar fix, would be nice to get it merged since this limits the current code only to very small graphs :)

@anistark
Copy link
Contributor

anistark commented Aug 28, 2025

Thanks for the PR @mludvig 🙌🏼 It does seemigly fix a critical issue.

Could you please add a test as well? So, we can validate this works and also for future serve as our north star.

Also, please rebase and check on the merge conflict.

@jjmachan
Copy link
Member

Hey @mludvig! 👋

Hope you're doing well! I really loved your work on reducing find_indirect_clusters() complexity from O(exp(n)) to O(n) - it's exactly the kind of thoughtful improvement that makes Ragas better for everyone.

Quick question for you - we're trying to figure out what to do with the Testset Generation module as we gear up for v0.4, and since you've been working in this space, I'd love to get your take on it.

Mind checking out this discussion when you have a moment?
🔗 #2231

Basically we're wondering if we should keep it as part of the core library, spin it off into its own thing, or maybe even retire it if folks aren't really using it much. No pressure at all, but given your experience with algorithm optimization and performance improvements, your perspective would be super helpful!

Just drop a 👍 👎 or 🚀 on the issue, or feel free to share any thoughts you have.

Thanks for being awesome! 🙏

@anistark
Copy link
Contributor

anistark commented Sep 1, 2025

Hi @mludvig

Have you also checked the diff in result from old and new approaches?

Would recommend adding a new function rather than replace the existing ones to have existing users not break, while providing an alternate algo with hops as you've suggested.

anistark pushed a commit that referenced this pull request Sep 8, 2025
…n and sampling (#2144)

Like @mludvig found in #2071, I notice the find_indirect_clusters' use
of exhaustive depth-first search to be a significant barrier to use on
any, even moderately substantial, knowledge graph. That PR uses BFS to
identify a set of disjoint clusters involving the source node (each node
appears in at most one cluster) whereas the original
find_indirect_clusters identifies all sets of clusters up to length
depth_limit from each node. @mludvig, if I'm out of line here, please
correct me!

The approach in this PR instead identifies neighborhoods in the graph
using a Leiden clustering algorithm and samples from the neighborhoods.
I believe this to be a better approach - in my testing it is even faster
than BFS, and the function returns something more in line to the
original `find_indirect_clusters` implementation.

I would have preferred (and, in fact, originally tried) using a Clique
Percolation Method approach because it allows nodes to belong to
multiple neighborhoods; however CPM also ends up running into NP-hard /
NP-complete runtime issues.

This PR does add two dependencies to Ragas - networkx and
scikit-network.
@anistark
Copy link
Contributor

anistark commented Sep 9, 2025

Closing this as #2144 covered the changes.

@anistark anistark closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants